Data collection


tl;dr

I compared the sugar and sodium content of the 7 biggest American cereal manufacturers.
+ Made an animated plot with gganimate to compare the 7 biggest American cereal manufacturers.
+ Wrote an User-Defined function that returns the sugar and sodium content.


0. Assignment introduction

POV: It’s early December and you’re sitting in on your living room floor with coffee. Your pet sashays past the stack of books that lay beside you with their food bowl in their mouth, clearly asking your to feed them. You, immersed in your laptop, practicing R, get up and fill their bowl but as you stand in the kitchen you find out there is no human food left in the house. You put on your jacket and leave for the grocery shop. Standing before the breakfast cereal isle you are reluctant to decide which cereal to pick. A voice in your head goes “I could write an R script for this”. And so, you bow your head and leave the shop with enough to get you through the weekend.

During the evening however, when the sound of rain fills your ears and the fur of your pet touches your leg, you find yourself again in the living room, and reach out for your laptop again, and start typing.


## Load required packages
if (!require(rcolorUtrecht)) install.packages("rcolorUtrecht")
if (!require(tidyverse)) install.packages("tidyverse")
if (!require(tidytext)) install.packages("tidytext")    #tidy data of text files
if (!require(ggplot2)) install.packages("ggplot2")      # data visualization with plots
if (!require(magick)) install.packages("magick")        #putting animated graphs side-by-side
if (!require(readxl)) install.packages("readxl")        #imports Excel and CSV files
if (!require(gifski)) install.packages("gifski")        #imports Excel and CSV files
if (!require(knitr)) install.packages("knitr")          #text mining analysis
if (!require(dplyr)) install.packages("dplyr")          #tidy data 
if (!require(here)) install.packages("here")            #informs R about the current working directory
if (!require(car)) install.packages("car")              #Levene statistical test
if (!require(png)) install.packages("png")              
if (!require(DT)) install.packages("DT")                #interactive tables

## installation of devtools is necessary prior to installing the following packages

#devtools::install_github("thomasp85/gganimate")         #makes gif-like graphs
library(gganimate)
#devtools::install_github("thomasp85/patchwork")         #allows for combining graphs
library(patchwork)
#devtools::install_github("ropensci/plotly")             #interactive graphs
library(plotly)

## installation of remotes is necessary prior to installing the following package
#remotes::install_github("wilkelab/ggtext")              #customization of graph text using HTML rendering
library(ggtext)

1. Data import

## Load in the dataset using the tidyverse function read_csv()
cereal <- read.csv(here::here("raw_data/cereal.csv"))


2. Prepping the data

An important step of data prepping is data cleaning and tidying to ensure the following:
+ All variables have each own column (tidy data).
+ All column names are easily understood.
+ The data type of the variables are correct.

The process of prepping data includes data cleaning. Here I clean data by ensuring all variables have a clear enough name, and that all values are stated as words and not initials.

The biggest reason for this is because during the analysis plots will be generated. A graph and what it conveys should be understood all on its own, without having to refer back to the source material time and time again. Moreover, imagine having to look back at your analysis a few months from now and not understanding your own work anymore because you haven’t gathered enough information in one place.

## Change data from dataframe to tibble
cereal_tbl = as_tibble(cereal)

## let's take a look
DT::datatable(cereal_tbl)


3. Statistical testing

Before performing any statistical tests, I first check whether the data distribution is normal.

In frequentest statistics it is necessary to determine whether the data follows a normal distribution in a population. In general frequentest statistics views samples as unknown yet fixed and there is always a fixed average, with some sample being above average and some below.

The first step of frequentest statistics is checking is our data is normally distributed. We assume a null-hypothesis (H0) which always states that the data is in fact NOT normally distributed. When this happens, we are actually forced to say that our data is not trustworthy and stop the analysis (I will not be doing this in this instance). Alternatively, a H1 hypothesis is also taken into consideration and this always states that the values are normally distributed.

The output of a normality test (Shapiro-Wilk) is a number which is called a P-value. When the P-value equals 0.05 or is above 0.05 we say that the data is normally distributed and we accept the H0 hypothesis. When this number is below 0.05 we say data is NOT normally distributed and we accept the H1 hypothesis.


3.1. Testing normal distribution with the Shapiro-Wilk test

## Normality test for SUGAR
p_sugar = shapiro.test(cereal_tbl$sugars_g)

## Normality test for SODIUM
p_sodium = shapiro.test(cereal_tbl$sodium_mg)

## Let's have a look at the P-values
p_sugar
## 
##  Shapiro-Wilk normality test
## 
## data:  cereal_tbl$sugars_g
## W = 0.95247, p-value = 0.005923
p_sodium
## 
##  Shapiro-Wilk normality test
## 
## data:  cereal_tbl$sodium_mg
## W = 0.92904, p-value = 0.0003456


3.2. Checking for viariances with the Levene test

## Check for variance in the SUGAR values
levene_sugar = leveneTest(cereal_tbl$sugars_g, as.factor(cereal_tbl$manufacturer), center = mean)

## Check for variance in the SODIUM values
levene_sodium = leveneTest(cereal_tbl$sodium_mg, as.factor(cereal_tbl$manufacturer), center = mean)

## Let's see
levene_sugar
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  6  1.3471 0.2483
##       70
levene_sodium
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value   Pr(>F)   
## group  6  3.1152 0.009188 **
##       70                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


3.3. One-Way ANOVA for the SUGAR (g) content

We’ve now seen that the data wasn’t normally distributed. In most circumstances we would disregard the data altogether after this but in a Bayesian approach to this dataset I have chosen to continue.

After performing a Levene test we saw that the variance within groups wasn’t significant. Which means that the data point between manufacturers aren’t all that different from each other.

With the statistical test One-Way ANOVA test we put the manufacturers side by side and determine whether there is a relationship between the. We have more than 3 groups so we choose the ANOVA test and not a t-test. The outcome of this test is also a P-value however this test determines whether the difference that we find is truly significant. In other words “Do they really differ?”

If the P-value that we find is below 0.05 we again reject the H0 hypothesis and say that there is indeed a significant difference in sugar content when comparing the 7 manufacturers.

A way to build on the ANOVA test is by using the TukeyHSD function which takes into consideration each combination of manufacturer and the adjusted p-value (padj) for that difference.

## Perform a One-Way ANOVA test to test a relationship between manufacturer
## ANOVA test with SUGAR
sugar_anova = cereal_tbl %>% aov(sugars_g ~ manufacturer, data = .) %>% summary

## Is the lack of significance caused by a particular manufacturer?
sugar_tukey = cereal_tbl %>% aov(sugars_g ~ manufacturer, data = .) %>% TukeyHSD()

## let's get some perspective
head(sugar_anova)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## manufacturer  6  262.2   43.69   2.468 0.0319 *
## Residuals    70 1239.4   17.71                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
head(sugar_tukey)
## $manufacturer
##                                                  diff         lwr        upr
## General Mills-American Home Food Products   4.9545455  -8.1061458 18.0152367
## Kelloggs-American Home Food Products        4.5652174  -8.4831234 17.6135581
## Nabisco-American Home Food Products        -1.1666667 -14.9637403 12.6304069
## Post-American Home Food Products            5.7777778  -7.6867874 19.2423429
## Quaker Oats-American Home Food Products     2.2500000 -11.2984573 15.7984573
## Ralston Purina-American Home Food Products  3.1250000 -10.4234573 16.6734573
## Kelloggs-General Mills                     -0.3893281  -4.1986229  3.4199667
## Nabisco-General Mills                      -6.1212121 -12.0043041 -0.2381202
## Post-General Mills                          0.8232323  -4.2310772  5.8775419
## Quaker Oats-General Mills                  -2.7045455  -7.9782753  2.5691844
## Ralston Purina-General Mills               -1.8295455  -7.1032753  3.4441844
## Nabisco-Kelloggs                           -5.7318841 -11.5875062  0.1237381
## Post-Kelloggs                               1.2125604  -3.8097483  6.2348691
## Quaker Oats-Kelloggs                       -2.3152174  -7.5582858  2.9278510
## Ralston Purina-Kelloggs                    -1.4402174  -6.6832858  3.8028510
## Post-Nabisco                                6.9444444   0.2121619 13.6767270
## Quaker Oats-Nabisco                         3.4166667  -3.4818701 10.3152035
## Ralston Purina-Nabisco                      4.2916667  -2.6068701 11.1902035
## Quaker Oats-Post                           -3.5277778  -9.7346356  2.6790801
## Ralston Purina-Post                        -2.6527778  -8.8596356  3.5540801
## Ralston Purina-Quaker Oats                  0.8750000  -5.5118040  7.2618040
##                                                 p adj
## General Mills-American Home Food Products  0.90941564
## Kelloggs-American Home Food Products       0.93692600
## Nabisco-American Home Food Products        0.99997464
## Post-American Home Food Products           0.84845188
## Quaker Oats-American Home Food Products    0.99872162
## Ralston Purina-American Home Food Products 0.99217422
## Kelloggs-General Mills                     0.99992259
## Nabisco-General Mills                      0.03608643
## Post-General Mills                         0.99885466
## Quaker Oats-General Mills                  0.70963334
## Ralston Purina-General Mills               0.93933394
## Nabisco-Kelloggs                           0.05896520
## Post-Kelloggs                              0.99002273
## Quaker Oats-Kelloggs                       0.83041483
## Ralston Purina-Kelloggs                    0.98050196
## Post-Nabisco                               0.03883208
## Quaker Oats-Nabisco                        0.74179283
## Ralston Purina-Nabisco                     0.49462299
## Quaker Oats-Post                           0.60168469
## Ralston Purina-Post                        0.85085020
## Ralston Purina-Quaker Oats                 0.99957403

What we can see is that when putting manufacturers Nabisco-General Mills side by side there is indeed a significant difference in sugar content (per serving given in grams). The same is seen when comparing manufacturers Post-Nabisco.

3.4. One-Way ANOVA for the SODIUM (mg) content

## Perform a One-Way ANOVA test to test a relationship between manufacturer
## ANOVA test with SUGAR
sodium_anova = cereal_tbl %>% aov(sodium_mg ~ manufacturer, data = .) %>% summary

## Is the lack of significance caused by a particular manufacturer?
sodium_tukey = cereal_tbl %>% aov(sodium_mg ~ manufacturer, data = .) %>% TukeyHSD()

## let's get some perspective
head(sodium_anova)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## manufacturer  6 206474   34412   7.352 4.05e-06 ***
## Residuals    70 327643    4681                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
head(sodium_tukey)
## $manufacturer
##                                                   diff          lwr        upr
## General Mills-American Home Food Products   200.454545  -11.9020580 412.811149
## Kelloggs-American Home Food Products        174.782609  -37.3731847 386.938402
## Nabisco-American Home Food Products          37.500000 -186.8296029 261.829603
## Post-American Home Food Products            146.111111  -72.8121647 365.034387
## Quaker Oats-American Home Food Products      92.500000 -127.7872970 312.787297
## Ralston Purina-American Home Food Products  198.125000  -22.1622970 418.412297
## Kelloggs-General Mills                      -25.671937  -87.6080847  36.264211
## Nabisco-General Mills                      -162.954545 -258.6090095 -67.300081
## Post-General Mills                          -54.343434 -136.5225464  27.835678
## Quaker Oats-General Mills                  -107.954545 -193.7012595 -22.207831
## Ralston Purina-General Mills                 -2.329545  -88.0762595  83.417169
## Nabisco-Kelloggs                           -137.282609 -232.4904347 -42.074783
## Post-Kelloggs                               -28.671498 -110.3303005  52.987305
## Quaker Oats-Kelloggs                        -82.282609 -167.5307911   2.965574
## Ralston Purina-Kelloggs                      23.342391  -61.9057911 108.590574
## Post-Nabisco                                108.611111   -0.8505268 218.072749
## Quaker Oats-Nabisco                          55.000000  -57.1648015 167.164801
## Ralston Purina-Nabisco                      160.625000   48.4601985 272.789801
## Quaker Oats-Post                            -53.611111 -154.5297548  47.307533
## Ralston Purina-Post                          52.013889  -48.9047548 152.932533
## Ralston Purina-Quaker Oats                  105.625000    1.7805723 209.469428
##                                                   p adj
## General Mills-American Home Food Products  7.679218e-02
## Kelloggs-American Home Food Products       1.748896e-01
## Nabisco-American Home Food Products        9.986733e-01
## Post-American Home Food Products           4.079992e-01
## Quaker Oats-American Home Food Products    8.610795e-01
## Ralston Purina-American Home Food Products 1.058934e-01
## Kelloggs-General Mills                     8.682295e-01
## Nabisco-General Mills                      4.220903e-05
## Post-General Mills                         4.193706e-01
## Quaker Oats-General Mills                  5.053407e-03
## Ralston Purina-General Mills               1.000000e+00
## Nabisco-Kelloggs                           7.831643e-04
## Post-Kelloggs                              9.358931e-01
## Quaker Oats-Kelloggs                       6.546524e-02
## Ralston Purina-Kelloggs                    9.808155e-01
## Post-Nabisco                               5.314904e-02
## Quaker Oats-Nabisco                        7.505691e-01
## Ralston Purina-Nabisco                     8.698145e-04
## Quaker Oats-Post                           6.747117e-01
## Ralston Purina-Post                        7.048225e-01
## Ralston Purina-Quaker Oats                 4.361868e-02

The same statistical test is done to test for a significant difference in sodium content (given in milligrams per serving) per manufacturer. Surprisingly, every single combination of manufacturers shows a significant difference of sodium content.


4. Data visualization

Now that we asked the question is there a significant difference between the 7 manufacturers (turns out the answer is yes), we can calculate the mean of the sugar content and sodium content and plot them.

I do this by using the ggplot2 package as well as gganimate to turn the plots into a gif.



4.1. Graphing the sugar content of 80 cereals

sugar_80 <- cereal_tbl %>% 
            group_by(manufacturer) %>% 
            ggplot(aes(x = cereal_name,
                       y = sugars_g,
                       fill = cereal_name)) +
            geom_bar(stat = "identity") +
            coord_flip() +
            labs(title = "Comparing the sugar content of 80 American cereals",
                  subtitle = "Sugar content is given per serving in grams",
                  x = "Sugar (g)",
                  y = "Cereal names") +
            coord_flip() +
            theme_minimal() +
  scale_fill_rcolorUtrecht(palette = "microscope")
  theme(legend.position = "none")
## List of 1
##  $ legend.position: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
## make the graph interactive
ggplotly(sugar_80)

Graph 3: Bar chart of the 80 cereals and their sugar content (g)

4.2. Graphing the sodium content of 80 cereals

sodium_80 <- cereal_tbl %>% 
             group_by(manufacturer) %>% 
             ggplot(aes(x = cereal_name,
             y = sodium_mg,
             fill = cereal_name)) +
             geom_bar(stat = "identity") +
             labs(title = "Comparing the sodium content of 80 American cereals",
                  subtitle = "Sodium content is given per serving in milligrams",
                  x = "Sodium (mg)",
                  y = "Cereal names") +
             coord_flip() +
             theme_minimal() +
  scale_fill_rcolorUtrecht(palette = "microscope")+
  theme(legend.position = "none")

## if you're brave you can make a different plot for every manufacturers and their cereal brands
             facet_wrap(~manufacturer)
## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>
## make the graph interactive                          
ggplotly(sodium_80)

Graph 4: Bar chart of the 80 cereals and the sodium content (mg)

4.3. Combine the gif graphs

Huge thanks to Connor Rothschild from {Rothschild (n.d.)}
Gif graphs, or animated graphs are pretty neat however just like any other type of graph we want to compare the two to get a clearer picture of what we’re testing. The steps prior have been spent determining what the mean values of sugar and sodium is in cereal, depending on the manufacturer. Those mean values have been plotted, but having to scroll up and down to see them isn’t essential, so below I put them side by side.

##turn the sugar plot into a gif
sugar_gif <- animate(sugar,
                 fps = 10,
                 duration = 25,
        width = 550, height = 500,
        renderer = gifski_renderer(here::here("images", "sugar.gif")))

##turn the sodium plot into a gif
sodium_gif <- animate(sodium,
                 fps = 10,
                 duration = 25,
        width = 550, 
        height = 500,
        renderer = gifski_renderer(here::here("images", "sodium.gif")))

##a gif is effectively may pictures of an image shown quickly
##store each frame of the plot in an object with image_read
sugar_mgif <- image_read(sugar_gif)
sodium_mgif <- image_read(sodium_gif)

##put both gif plots side by side
new_gif <- image_append(c(sugar_mgif[1], sodium_mgif[1]), stack = FALSE)

for(i in 2:90){
  combined <- image_append(c(sugar_mgif[i], sodium_mgif[i]), stack = FALSE)
  new_gif <- c(new_gif, combined)
}

##let's have a look
new_gif

Graph 5: Mean sugar and sodium content of the 7 biggest cereal manufacturers


5. User-defined function that returns the sugar content of a cereal

I really love R. Namely because you can see the results of your code right away in the form of a graph, however when your dealing with a dataset that has many variables a simple graph won’t always cut it. While there are few manufacturers in this dataset, those manufacturers make a lot of types of cereal: 80 to be precise.

A graph with 80 bars just isn’t always to read and we have already established a great deal of information with the graphs displayed above.

What exactly? Well, that on average manufacturers Post and General Mills produce cereals that contain the highest content of sugars per servings, while Post and Ralston Purina produce cereals with the highest sodium content.

Let’s say we want to dive in a little deeper (I know I do) and say, while comparing the biggest manufacturers on the market is great but how about we pull out of magnifying glass and comparing all 80 cereals without a plot.

A method of doing this is by making a custom function within R that asks us the name of the cereal and when we answer it returns the sugar and sodium content. The way to do this is by telling R your going to filter a column, your going to give R a name and that R needs to return the value of that column {Rodrigues (2020)}.

Let me show you below how I do that.

N.B. The dataset is made up of 80 cereals. If we give a cereal name that isn’t in the list we will not get an answer.

## With the heartiest thanks to Bruno Rodrigues whom's bookdown "Modern R with the tidyverse"
## has helped with making this function

##function name
nutrition <- function(dataset, col_name, value){
             col_name <- enquo(col_name)
             dataset %>%
               
## My goal is to make a function that will return all nutritional values when I give a cereal name so,
## filter the column that contains the cereal names
             filter((!!col_name) == value) 

## If getting the entire list of nutrients isn't your thing then,
## summarize to get the mean sugar and sodium value
  #%>%
   # summarise(mean_sugar = mean(sugars_g),
    #          mean_sodium = mean(sodium_mg))
}


## Our function name is nutrition
## Give the function the dataset name, column name and lastly the name of the cereal
nutrition(cereal_tbl, cereal_name, "Apple Jacks")
## # A tibble: 1 × 15
##   cereal_name manufacturer type   kcal protein_g fat_g sodium_mg fiber_g carbs_g
##   <chr>       <chr>        <chr> <int>     <int> <int>     <int>   <dbl>   <dbl>
## 1 Apple Jacks Kelloggs     Cold    110         2     0       125       1      11
## # … with 6 more variables: sugars_g <int>, potass_mg <int>, vit_perc <int>,
## #   shelf <int>, weight_oz <dbl>, cups <dbl>

In summation

  • Manufacturers Post and General Mills produce cereals with the highest sugar content per serving.
  • Manufacturers General Mills and Ralston Purina produce cereals with the highest sodium content per serving.
  • Cereal Smacks and Golden Crisp contain the highest sugar content with 15 grams per serving, followed by Apple Jacks which contains 14 grams sugar per serving.
  • Cereal brand Product 19 contains the highest amount of sodium with 310 milligrams per serving, followed by Rice Krispies, Cheerios and Corn Flakes which contains 290 milligrams per serving.

How serious should we take these finding?
At the beginning of this analysis we found that the Shapiro-Wilk test determined that the data was not normally distributed which was not very surprising considering the relatively small number of values we are dealing with. However this is still a small set of cereals that we’re analyzing and seeing as it’s unclear when the dataset was last updated, I would suggest we take these findings with a pinch of sodium(chloride).

Aesthetics

<!--Color of the content bar-->
.list-group-item.active, .list-group-item.active:hover, .list-group-item.active:focus {
    z-index: 2;
    color: #ffffff;
    background-color: #edc9af;
    border-color: #edc9af;
}

.list-group-item.active, .list-group-item.active, .list-group-item.active:focus {
    z-index: 2;
    color: #ffffff;
    background-color: #edc9af;
    border-color: #edc9afs;
}

body {
    font-family: 'Playfair Display', serif;
    font-size: 15px;
    line-height: 1.42857143;
    color: #777777;
    background-color: #ffffff;
}
<!--Color of the h1 header-->
.columns {display: flex;}
h1 {color: #000000;}
.p {background-color: #edc9af;}

Sources I used

  • All lessons from my minor course
Rodrigues, Bruno. 2020. “Modern r with the Tidyverse.” Chapter 7 Defining Your Own Functions. https://b-rodrigues.github.io/modern_R/defining-your-own-functions.html.
Rothschild, Connor. n.d. “How to Combine Animated Plots in r.” Connor Rothschild. https://www.connorrothschild.com/post/tidy-tuesday-powerlifting.